13 分布式架构经典图书和论文
经典图书
- Distributed Systems for fun and profit 以一种更易于理解的方式,讲述以亚马逊的 Dynamo、谷歌的 Bigtable 和 MapReduce 等为代表的分布式系统背后的核心思想
- Designing Data Intensive Applications 本书深入浅出地用很多的工程案例讲解了如何让数据结点做扩展
- Distributed Systems: Principles and Paradigms 介绍了分布式系统的七大核心原理,并给出了大量的例子
- Scalable Web Architecture and Distributed Systems 主要针对面向互联网(公网)的分布式系统
- Principles of Distributed Systems 讲述了多种分布式系统中会用到的算法
经典论文
分布式事务
- 《Transaction Across DataCenter》(YouTube 视频)Google I/O 大会上的演讲
Paxos 一致性算法
一种基于消息传递且具有高度容错特性的一致性算法
- Bigtable: A Distributed Storage System for Structured Data
- The Chubby lock service for loosely-coupled distributed systems
- The Google File System
- MapReduce: Simplified Data Processing on Large Clusters
- Neat Algorithms - Paxos
Raft 一致性算法
- In search of an Understandable Consensus Algorithm (Extended Version)
- Raft - The Secret Lives of Data
- Raft Consensus Algorithm
- Raft Distributed Consensus Algorithm Visualization
Gossip 一致性算法
- Dynamo: Amazon’s Highly Available Key Value Store 讲述 Amazon 的 DynamoDB 是如何满足系统的高可用、高扩展和高可靠的
- Time, Clocks and the Ordering of Events in a Distributed System 主要解决分布式系统中的时钟同步问题
- 马萨诸塞大学课程 Distributed Operating System 中第 10 节 Clock Synchronization 讲述了时钟同步的问题
- Why Vector Clocks are Easy 和 Why Vector Clocks are Hard Vector Clock相关
- Efficient Reconciliation and Flow Control for Anti-Entropy Protocols 用来做数据同步的 Gossip 协议的原始论文
- Understanding Gossip (Cassandra Internals) Gossip 协议也是 NoSQL 数据库 Cassandra 中使用到的数据协议
- Gossip Visualization 关于 Gossip 的一些图示
分布式存储和数据库
- Amazon Aurora: Design Considerations for High Throughput Cloud -Native Relation Databases
- Spanner: Google’s Globally-Distributed Database
- F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business
- Cassandra: A Decentralized Structured Storage System
- CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
分布式消息系统
- Kafka: a Distributed Messaging System for Log Processing
- Wormhole: Reliable Pub-Sub to Support Geo-replicated Internet Services
- All Aboard the Databus! LinkedIn’s Scalable Consistent Change Data Capture Platform
日志和数据
- The Log: What every software engineer should know about real-time data’s unifying abstraction
- The Log-Structured Merge-Tree (LSM-Tree)
- Immutability Changes Everything
- Tango: Distributed Data Structures over a Shared Log)
分布式监控和跟踪
- Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
数据分析
- The Unified Logging Infrastructure for Data Analytics at Twitter Twitter 公司的一篇关于日志架构和数 据分析的论文
- Scaling Big Data Mining Infrastructure: The Twitter Experience 讲 Twitter 公司的数据分析平台是怎么做的
- Dremel: Interactive Analysis of Web-Scale Datasets 介绍了Google 公司的 Dremel 的架构与实现,以及它与 MapReduce 是如何互补的
- Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing 论文提出了弹性分布式数据集(Resilient Distributed Dataset,RDD)的概念
与编程相关的论文
- Distributed Programming Model
- PSync: a partially synchronous language for fault-tolerant distributed algorithms
- Programming Models for Distributed Computing
- Logic and Lattices for Distributed Programming
其它的分布式论文阅读列表
- Services Engineering Reading List
- Readings in Distributed Systems
- Google Research - Distributed Systems and Parallel Computing